[js/webgpu] Optimize Gather op #17625

qjia7 · 2023-09-20T06:02:51Z

Description

This PR optimizes the gather op, which is improved ~6ms in segment anything model in ADL.
The problem in original algorithm is that it includes a for loop to calculate a block size of data. However, the block size may be very large, like 65536. In GPU shader, we should try to avoid large loop in shader and try to use more threads to do it parallelly.

Before:

[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns

After:

[profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns

qjia7 · 2023-09-20T06:06:25Z

@guschmue @fs-eire @satyajandhyala Please take a look, thanks.

qjia7 · 2023-09-20T06:48:52Z

cc @dakenf

dakenf · 2023-09-20T13:10:28Z

LGTM

fs-eire · 2023-09-22T00:54:18Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

fs-eire · 2023-09-22T00:54:20Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2023-09-22T00:54:36Z

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines · 2023-09-22T00:54:36Z

Azure Pipelines successfully started running 2 pipeline(s).

### Description This PR optimizes the gather op, which is improved ~6ms in segment anything model in ADL. The problem in original algorithm is that it includes a for loop to calculate a block size of data. However, the block size may be very large, like `65536`. In GPU shader, we should try to avoid large loop in shader and try to use more threads to do it parallelly. Before: ``` [profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 6886207 ns ``` After: ``` [profiling] kernel "41771992|[Gather] 41771992" input[0]: [4,65536] | float32, input[1]: [1] | int64, output[0]: [1,65536] | float32, execution time: 11719 ns

[js/webgpu] Optimize gather

27c7708

fs-eire approved these changes Sep 22, 2023

View reviewed changes

fs-eire merged commit 891fba3 into microsoft:main Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] Optimize Gather op #17625

[js/webgpu] Optimize Gather op #17625

qjia7 commented Sep 20, 2023

qjia7 commented Sep 20, 2023

qjia7 commented Sep 20, 2023

dakenf commented Sep 20, 2023

fs-eire commented Sep 22, 2023

fs-eire commented Sep 22, 2023

azure-pipelines bot commented Sep 22, 2023

azure-pipelines bot commented Sep 22, 2023

[js/webgpu] Optimize Gather op #17625

[js/webgpu] Optimize Gather op #17625

Conversation

qjia7 commented Sep 20, 2023

Description

qjia7 commented Sep 20, 2023

qjia7 commented Sep 20, 2023

dakenf commented Sep 20, 2023

fs-eire commented Sep 22, 2023

fs-eire commented Sep 22, 2023

azure-pipelines bot commented Sep 22, 2023

azure-pipelines bot commented Sep 22, 2023